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ABSTRACT 

This  report  describes  the  research  and  development  of  speaker  localisation  to  locate  the 
position  of  a  person  speaking.  Two  closed-form  localisation  techniques  were  analysed,  the 
first  was  developed  by  Schau  and  Robinson  (1987)  based  on  spherical  intersection  and  the 
other  developed  by  Chan  and  Ho  (1994).  Both  techniques  are  based  on  time  difference  of 
arrival  measurements.  Accordingly  three  time  delay  estimators,  namely  cross-correlation, 
generalised  cross-correlation,  and  an  eigenvalue  decomposition  based  algorithm  were 
analysed.  The  implementation  of  the  algorithms  in  Matlab  and  the  results  from  the  analyses 
are  discussed. 
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Speaker  Localisation  using  Time  Difference  of 

Arrival 


Executive  Summary 

Speaker  localisation  is  defined  as  the  ability  to  automatically  locate  the  position  of  a 
person  speaking  in  a  room.  A  localisation  algorithm  was  implemented  and  tested  for 
use  in  the  Intense  Collaboration  Space  (ICS)  smart  room  environment  at  DSTO, 
Edinburgh. 

Time  Difference  of  Arrival  (TDOA)  was  the  localisation  method  chosen.  TDOA 
localisation  is  based  on  two  parts:  a  time-delay  estimator  and  a  localisation  estimator. 
Two  closed-form  localisation  algorithms,  the  Spherical  Intersection  method  (Schau  and 
Robinson  1987)  and  a  closed-form  method  by  Chan  and  Ho  (1994)  were  chosen  for 
implementation  due  to  their  simplicity  and  speed.  Both  algorithms  were  found  to  be 
robust  against  small  time  delay  errors,  with  the  Chan  and  Ho  estimator  giving  a 
smaller  variance. 

Three  time-delay  estimators,  the  ordinary  cross-correlation,  the  generalised  cross¬ 
correlation  using  phase  transform  (GCC-PHAT)  and  an  eigenvalue  decomposition 
(ED)  algorithm  were  studied.  GCC-PHAT  and  ED  algorithms  were  implemented  in 
MATLAB  and  were  compared  with  MATLAB's  cross-correlation  function. 

Experiments  were  conducted  in  the  ICS  to  obtain  data  for  testing  purposes.  Noise 
signals  of  varying  sound  levels  were  played  out  through  a  computer  speaker  and 
recorded  using  the  wall  microphones  that  exist  in  the  room.  The  recorded  data  also 
included  the  background  noise  and  reflected  sounds  from  the  room. 

The  results  of  the  tests  showed  that  GCC-PHAT  and  ED  methods  worked  much  better 
than  the  ordinary  cross-correlation  in  the  presence  of  existing  background  noise.  GCC- 
PHAT  is  faster  than  ED  in  terms  of  computation  requirement  and  would  therefore  be 
more  suited  for  real-time  use. 
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1.  Introduction 


The  Command,  Control,  Communications  and  Intelligence  (C3I)  Division  in  DSTO  is 
conducting  research  into  networked  smart-room  environments  -  known  as  Livespaces 
(Weber,  2007)  -  for  distributed  collaborative  planning.  Knowledge  of  the  location  of  its 
occupants  would  allow  a  Livespace  to  better  support  them  by:  steering  microphone  arrays 
to  improve  the  quality  of  audio  pickup  for  recording,  communication  and  transcription; 
enhancing  the  separation  -  and  hence  recognition  -  of  speech  from  concurrent  speakers; 
steering  and  zooming  cameras  for  enhanced  face  and  gesture  recognition  during  video 
teleconferencing;  adapting  the  local  visual  and  auditory  presentation  of  information;  and 
providing  explicit  or  implicit  awareness  of  occupant  location  to  participants  at  remote 
sites..  Whilst  acknowledging  that  the  identification  of  occupants  would  afford  additional 
benefits,  this  report  focuses  solely  on  their  localisation. 

Localisation  methods  can  be  broadly  classified  according  to  whether  or  not  the  room 
occupants  contribute  to  their  localisation  by  carrying  or  wearing  equipment  which  assists 
that  localisation.  Such  equipment  might  for  example  include:  a  GPS-like  device  which 
calculates  position  using  signals  from  beacons  of  known  location;  an  emitter  or 
transponder  whose  reception  by  spatially  disparate  sensors  within  the  room  is  used  to 
calculate  position;  or  even  tabs  of  reflective  material  which  enhance  contrast  and  hence 
detection  by  video  processors.  Encumbrance  of  personnel  should  be  minimised  for  a 
variety  of  reasons,  including:  impaired  mobility  due  to  weight,  size  or  tethering  of 
equipment;  time  and  effort  required  to  don,  activate  and  later  shed  the  equipment; 
breakage  and  loss  of  shared  equipment;  the  potential  for  an  individual  to  become 
overburdened  with  a  proliferation  of  encumbrances  such  as  stereo  glasses,  a  personal 
microphone,  identification  tag  and  personal  digital  assistant;  and  the  need  to  power 
portable  devices  which  are  active. 

A  variety  of  sensing  modalities,  including  video,  audio,  infrared  and  ultrasound,  could 
contribute  to  the  detection,  localisation  and  tracking  of  an  unencumbered  occupant  of  a 
room.  Fusion  of  multiple  modalities  would  further  provide  for  more  accurate  and  reliable 
localisation  (see  e.g.  Gatica-Perez  et  al.,  2006,  on  fusing  video  and  audio),  albeit  at  the 
expense  of  additional  sensors  and  processing.  In  order  to  decide  on  a  suitable  mix  of 
modalities  for  a  particular  application/ requirement,  we  need  to  understand  the  physical 
and  technical  limitations  and  capabilities  of  each:  occlusion,  spatial  resolution,  background 
noise  and  clutter,  temporal  discontinuity  of  the  stimulus,  scalability  to  multiple  targets, 
sensor  and  processing  cost. 

This  report  measures  the  accuracy  of  techniques  available  in  localising  and  tracking 
individual  participants  through  their  speech  signals  received  at  multiple  microphones 
whose  position  within  the  room  are  known,  or  can  be  calculated  through  prior 
microphone  calibration  techniques.  In  answering  the  question  as  to  why  use  speech,  it  is 
advocated  that  meeting  environments  are  likely  to  already  have  a  number  of  microphones 
and  hence  may  not  need  additional  sensors  and  a  great  deal  of  processing.  In  such 
environments  signal  processing  equipment  e.g.  beamforming  microphone  arrays  may 
already  be  available  for  co-channel  speaker  separation,  noise  or  echo  cancellation  that 
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would  improve  the  performance  of  speaker  localisation  and  tracking  algorithms.  As  with 
audio  beamforming,  the  complexity  of  measurements  reported  in  this  paper  is  confined  to 
the  current  speaker,  since  this  is  required  in  applications  such  video  zoom  and  pan  for 
VTC.  Nevertheless  intermittent  nature  of  speech  remains  an  inherent  limitation  in  speech 
processing. 

The  bulk  of  the  work  described  in  this  report  was  undertaken  by  Derek  Thai  during  his 
placement  at  DSTO  Edinburgh  under  the  Graduate  Industry-Linked  Entrepreneur  Scheme 
(GILES).  Derek  was  co-supervised  by  Ahmad  Hashemi-Sakhtsari  (DSTO)  and  Matthew 
Trinkle  (University  of  Adelaide),  both  of  whom  made  substantial  contributions  to  the 
report  following  the  completion  of  Derek's  placement.  In  acknowledgement  of  his 
significant  editorial  input,  Tim  Pattison  was  subsequently  invited  to  become  an  author. 

1.1  Aim 

The  aim  of  this  project  is  to  describe  the  design  of  a  real-time  speaker  localisation  system 
for  use  in  smart  workspace  environments  for  collaborative  activities.  This  system  is 
intended  to  be  implemented  in  the  Intense  Collaboration  Space  (ICS)  Laboratory  smart 
room  at  DSTO. 

The  main  factors  determining  the  accuracy  of  the  final  system  are  expected  to  be 
background  noise  in  the  room  and  reverberation.  The  ICS  Laboratory  has  felt-covered 
walls  which  help  reduce  reverberations  to  a  level  that  allows  localisation  results  to  be 
obtained  relatively  consistently  without  requiring  special  techniques  for  reducing  the 
effects  of  reverberation.  However,  the  computer  rack  in  the  ICS  room  appears  to  be 
producing  a  significant  amount  of  background  noise  at  low  frequencies  that  includes  the 
speech  band.  This  noise  source  is  a  potential  problem  as  it  is  also  localised  and  will  not 
necessarily  be  much  reduced  by  cross-correlation  processing  because  it  is  partially 
correlated  similarly  to  the  desired  speech  signal. 


2.  Background 

There  are  many  methods  used  for  speaker  localisation.  These  can  be  loosely  divided  into 
three  categories:  direct  methods,  high-resolution  spectral  estimation-based  methods,  and 
time  difference  of  arrival  methods  (Brandstein  and  Silverman  1997). 

2.1  Direct  Methods 

Direct  methods  derive  the  estimate  directly  from  the  filtered,  weighted  and  summed 
signal  data  received  at  the  sensors.  These  are  steered  beamformer-based  methods,  which 
steer  an  array  of  microphones  to  search  for  peaks  in  source  output  power  (Brandstein  and 
Silverman  1997).  These  direct  methods  can  use  a  priori  knowledge  of  spectral  information 
of  the  signal  and  noise  to  improve  their  performance.  The  localisation  functions  are 
computationally  complex,  and  so  are  impractical  for  real-time  localisation.  Direct  methods 
include  maximum  likelihood  (ML)  estimators  (Bangs  and  Shultheis  1973;  Hahn  and 
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Tretter,  1973;  Hahn  1975;  Carter  1977),  standard  iterative  optimisation  methods  (Wax  and 
Kailath  1983)  and  correlation-based  estimators  (Silverman  and  Kirtman  1992). 

2.2  High-resolution  Spectral  Estimation-based  Methods 

High-resolution  spectral  estimation-based  methods  involve  estimating  the  spatio spectral 
correlation  matrix  from  the  data  received  at  the  sensors  (Brandstein  and  Silverman  1997). 
These  methods  are  less  robust  to  source  and  sensor  modelling  errors  compared  with 
beamforming  methods  and  assume  conditions  that  do  not  exist  in  the  real  world,  such  as 
an  exact  knowledge  of  sensors,  ideal  source  radiators  and  uniform  sensor  channel 
characteristics  (Brandstein  and  Silverman  1997).  Although  the  computational  complexity  is 
not  as  intense  as  beamformer-based  methods,  they  are  still  considerable  and  typically 
increase  as  additional  methods  are  used  to  improve  performance.  Examples  of  these  types 
of  methods  include  autoregressive  (AR)  modelling,  minimum  variance  (MV)  spectral 
estimation  and  eigenvalue  based  techniques  (Haykin  1991;  Johnson  and  Dudgeon  1993). 

2.3  Time  Difference  of  Arrival  Methods 

Time  difference  of  arrival  (TDOA)  estimators  are  based  on  a  two-step  method.  Initially, 
relative  time  delays  for  each  source  are  estimated  from  the  received  data.  These  time 
delays  are  then  used  to  generate  hyperbolic  curves  which  intersect  at  the  source  location 
estimate  as  shown  in  Figure  1. 

TDOA  methods  are  usually  based  on  the  generalised-cross-correlation  (GCC)  to  obtain 
time  delays  (Knapp  and  Carter,  1976),  especially  using  the  Phase  Transform  (GCC-PHAT) 
method  (Macho  et  al.  2005).  These  TDOA  methods  are  the  most  popular  of  the  three  due  to 
their  relatively  low  computational  complexity. 

The  two  stage  process,  although  reducing  the  complexity,  is  sub-optimal  as  a  location 
estimator.  It  is  also  difficult  for  this  method  to  work  with  multi-source  scenarios  since 
algorithms  assume  a  single-source  model  (Brandstein  and  Silverman,  1997). 

Two  main  types  of  TDOA  methods  are  ML  estimators  and  closed-form  estimators.  ML 
methods,  such  as  the  Taylor  Series  method,  are  iterative  and  computational  complexity 
can  be  high,  although  not  as  high  as  direct  methods  or  spectral  estimation-based  methods. 
Closed-form  methods,  such  as  the  Chan  and  Ho  method  (Chan  and  Ho  1994),  attempt  to 
linearise  the  non-linear  estimation  process.  They  are  computationally  fast  and  usually 
suffer  from  little  detriment  in  performance.  Due  to  their  sub-optimal  nature,  closed-form 
estimators  can  be  used  as  an  intermediate  step  by  providing  a  starting  point  for  more 
computationally  complex  methods  (Brandstein  and  Silverman  1997). 
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Figure  1  Intersection  of  two  hyperbolas  created  from  TDOA  estimates  provides  the  position  of  the 
speaker.  A  minimum  of  three  microphones  is  required  for  2D  localisation.  In  this  figure 
microphones  2  and  3  are  referenced  to  microphone  1. 


3.  Project  Overview 

The  work  reported  here  is  part  of  an  ambitious  larger  modular  software  system  that  aims 
to  integrate  real-time  audio  and  video  recording  and  transcription  modules  to  achieve  data 
fusion  for  workflow  monitoring.  Ultimately  an  integrated  environment  for  speaker 
localisation  and  tracking  will  be  set  up  to  collect  meeting  participants'  data  using  audio, 
video  and  text  on  a  DVD.  As  well  as  meetings  and  interviews,  the  system  could  be  tailored 
to  be  used  for  training,  teaching,  analysis,  scriptwriting  and  developing  protocols. 

This  project  focuses  on  using  TDOA  for  speaker  localisation  in  a  smart  room  environment. 
TDOA  localisation  has  the  most  efficient  computational  complexity  and  so  is  more  suited 
to  real-time  applications  than  other  localisation  methods.  The  localisation  system  is 
intended  to  be  implemented  in  the  ICS  Laboratory  at  DSTO.  The  ICS  Laboratory  is  a  smart 
collaboration  space  that  is  part  of  the  LiveSpaces  initiative.  It  employs  a  computer 
controlled  adaptive  environment  with  communication  and  interaction  facilities  such  as 
notebook/  tablet  computers,  screen  projectors,  cameras,  microphones  and  multiple  writing 
surfaces  for  hand  diagrams  and  writings  which  can  be  digitally  captured  (see  Figure  2). 
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Figure  2.  ICS  Laboratory  at  DSTO. 

There  are  three  main  parts  to  this  project.  The  first  part  is  the  research  into  TDOA  methods 
and  their  requirements  (Section  4).  The  second  part  is  the  implementation  of  the  selected 
algorithms  in  MATLAB  for  testing  and  evaluation  purposes  (Section  5).  The  third  part  is 
the  conduct  of  experiments  using  the  implemented  algorithms  to  determine  which  is  more 
effective  for  real-time  speaker  localisation  (Section  6). 

3.1  Background  Research 

Research  was  conducted  on  two  fronts:  background  research  into  TDOA  localisation 
algorithms,  and  research  into  time-delay  estimators  that  are  used  for  TDOA  localisation. 

Two  closed-form  TDOA  methods  were  chosen  for  this  project:  Spherical  Intersection  (SX) 
method  (Schau  and  Robinson  1987)  and  another  method  proposed  by  Chan  and  Ho  (1994) 
(dubbed  the  Ho-Chan  method).  The  main  reason  for  this  choice  was  that  these  methods 
are  very  fast  and  efficient  when  compared  to  other  localisation  methods,  and  hence  most 
useful  in  a  real-time  system.  A  real  time  system  would  be  required  if  a  camera  were  to  be 
used  to  track  a  speaker.  Other  reasons  include  the  straightforwardness  of  the 
implementation  of  this  method  and  the  fact  that  closed-form  estimators  do  not  require  an 
initial  estimate.  Previous  work  done  by  Rice  (2004)  and  Pang1  was  based  on  the  SX 
method. 

Three  time-delay  estimators  were  researched  for  this  project:  the  ordinary  cross¬ 
correlation,  the  generalised  cross-correlation  and  the  eigenvalue  decomposition  algorithm. 


1  Daniel  Pang  researched  the  SX  method  while  conducting  a  project  for  DSTO  based  on  speaker  localisation 
in  2005. 
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3.2  Implementation  of  Algorithms 

The  localisation  system  was  implemented  in  three  parts:  the  localisation  algorithm,  the 
time-delay  estimators,  and  the  iterator. 

The  localisation  algorithm  reads  in  the  relative  time-delays  between  microphones  and  the 
microphone  positions  as  inputs.  Based  on  this  information  it  produces  the  source  position 
estimate.  The  time-delay  estimators  produce  the  relative  time-delays  for  the  localisation 
algorithm.  All  three  algorithms  mentioned  in  section  3.1  were  implemented  for 
comparison. 

The  iterator  is  a  program  that  includes  a  time-delay  estimator  and  the  localisation 
algorithm.  It  repeats  the  localisation  process  multiple  times  to  track  the  sound  source. 
There  are  two  versions  of  the  iterator,  the  first  reads  in  pre-recorded  computer  audio  files 
as  input,  the  second  reads  from  the  computer  audio  I/O  card  so  that  data  received  at  the 
microphones  can  be  used  directly.  This  very  simple  tracking  method  was  implemented 
due  to  time  constraints.  In  future  more  sophisticated  tracking  algorithms  such  as  the 
Kalman  Filter  or  Particle  Filter  should  be  used. 

3.3  Experiments  and  Analysis 

A  set  of  experiments  were  conducted  in  the  ICS  Laboratory.  Computer  generated  noise 
signals  of  different  bandwidths  (see  Table  1)  were  emitted  from  a  computer  speaker  in  the 
room  and  were  recorded  using  microphones  attached  to  the  walls.  The  recorded  audio 
data  included  background  noise  and  reverberation  that  exist  in  the  room,  and  were  used 
to  test  the  effectiveness  of  the  time-delay  estimators. 


4.  Research 


Localisation  techniques  based  on  TDOA  estimates  are  a  two  part  process.  First  the  relative 
time-delays  between  the  receivers  are  estimated  from  the  received  data,  and  then  the 
localisation  algorithm  uses  the  time-delays  to  estimate  the  position  of  the  source.  The 
research  of  the  algorithms  hence  is  divided  into  two  main  sections:  the  localisation 
algorithms  and  the  time-delay  estimation  methods. 

4.1  Localisation  Algorithms 

The  SX  method  by  Schau  and  Robinson  (1987)  and  the  Ho-Chan  method  by  Chan  and  Ho 
(1994)  were  the  two  TDOA  localisation  methods  investigated  in  this  project. 

The  basis  behind  the  two  methods  is  similar  in  that  both  methods  rely  on  minimising  a 
least  squares  (LS)  estimator.  The  SX  method  utilises  the  relationship  in  (1)  between  the 
source  coordinates  (x,  y )  and  distance  r,  when  finding  the  coordinate  estimates: 
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(x,  -xf  +  ( v i  -yf  =r 


(1) 


where  (x„  yi)  are  the  coordinates  of  sensor  i. 

The  Ho-Chan  method  initially  assumes  that  the  three  variables  x,  y  and  r,  are  independent 
and  uses  LS  to  find  an  initial  estimate.  It  then  puts  the  relationship  in  (1)  into  the 
calculation  combined  with  the  initial  estimate  to  produce  the  final  estimate.  This  results  in 
estimates  that  are  optimum  in  a  'Least  Squares'  sense  whereas  the  SX  method  does  not 
(Chan  and  Ho  1994). 

4.1.1  Spherical  Intersection  (SX)  Method 

The  SX  method  uses  an  array  of  at  least  three  sensors  of  which  one  is  selected  to  be  the 
reference  sensor.  Without  any  loss  of  generalisation,  let  sensor  1  be  the  reference  sensor. 
Thus  for  i  =  1  the  equation  (1)  becomes 


(l  -xf  +(xi  -  yf  ~ r\  ■ 


Let  X  = 


(x2  -  x  \) 

(*3  ~Xf 


(w  -yf) 

(y3  ~  Ti ) 

(yN  -  yf). 


K t  =  xf  +  yf  for  i  =  1  to  N 


(2) 


where  N  is  the  number  of  sensors. 


Subtracting  (2)  from  (1)  where  i  =  2  to  N  gives  a  set  of  equations  which  can  be  written  in 
the  form 


x 

y . 


=  CV,  +  D 


(3) 


where 

c  =  (xTxfl 


“r2,l 

”  _  rfx  +  K2-  Kx~ 

-r31 

rn 

- rf ,  +K,-K, 

,  d  =  \xtx )  xt 

3,1  3  1 

\  / 

~  rN,\  _ 

—  rN,\  +  Kn  ~Kl 

Where  q  ■  is  the  TDOA  distance  between  microphones  j  and  i. 


Substituting  (2)  into  (3)  gives  a  quadratic  equation  of  the  form 


arf  +  fii\+  x  =  0  . 


(4) 
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For  a  complete  derivation  see  Appendix  A.l. 

Solving  the  quadratic  in  (4)  produces  two  results  for  ft.  The  correct  result  can  then  be  put 
back  into  (3)  to  solve  for  (x,  y).  However  choosing  the  correct  result  can  be  problem.  There 
are  four  possible  solutions  that  can  occur  in  the  quadratic:  two  negative  solutions,  two 
positive  solutions,  one  positive  and  one  negative  solution  or  two  complex  solutions. 

Pang  researched  this  problem  and  found  that  a  correct  result  exists  when  the  quadratic 
roots  are  either  both  positive  or  one  is  positive  and  one  is  negative.  When  the  roots  are 
both  positive  the  smaller  root  should  be  chosen.  When  there  is  one  positive  and  one 
negative  root  the  positive  root  should  be  chosen  as  ft  is  distance  measure  and  cannot  be 
negative.  When  the  other  two  types  of  roots  occur  (both  negative  or  complex)  it  means 
that  there  is  no  solution  to  the  problem  and  the  source  cannot  be  estimated. 

Simulations  were  conducted  based  on  this  selection  process  to  determine  its  accuracy.  A 
virtual  rectangular  room  of  4  x  4  m  was  used  in  the  simulations  (although  any  sized 
rectangular  room  would  suffice).  Four  microphone  sensors  were  placed  in  the  room  in  two 
configurations:  the  first  configuration  had  a  microphone  in  each  corner  of  the  room;  the 
second  configuration  had  a  microphone  in  the  middle  of  each  wall. 

Figures  3-7  show  the  results  of  configuration  1.  The  microphones  are  indicated  by  the  red 
and  black  dots;  the  red  dot  is  the  reference  microphone.  Figure  3  shows  the  coordinates 
found  when  there  is  one  positive  and  one  negative  root.  Figure  4  shows  the  coordinates 
found  when  there  are  two  positive  roots.  Figure  5  shows  the  combined  coordinates. 
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Positive  and  negative  roots 
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Figure  3.  Microphone  configuration  1;  the  shaded  area  indicates  where  the  quadratic  has  one 
positive  and  one  negative  root. 
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Two  positive  roots 


Figure  4.  Microphone  configuration  1;  the  shaded  area  indicates  where  the  quadratic  has  two 
positive  roots. 


Combining  the  two  results 


X-co-ordinate  (m) 


Figure  5.  Microphone  configuration  1;  combining  the  areas  shown  in  Figures  1  and  2.  The 
pink/magenta  area  is  where  one  positive  root  and  one  negative  root  occurred;  the  blue 
area  is  where  two  positive  roots  occurred. 


As  can  be  seen  in  Figure  5  the  whole  room  can  be  covered  by  this  selection  process.  There 
is  a  distinct  line  between  the  areas  where  two  positive  roots  occur  and  where  one  positive 


9 


DSTO-TR-2126 


root  and  one  negative  root  occur.  To  see  whether  there  are  irregularities  along  that  line, 
standard  deviation  plots  were  created  using  the  same  configuration. 

Note  how  there  is  no  distinct  break  line  that  occurs  in  Figure  5.  This  indicates  that  the 
selection  process  is  correct  and  the  solution  found  is  unique.  Figures  6  and  7  show  the 
standard  deviation  plots  of  the  estimates  when  random  white  Gaussian  noise  with  a  fixed 
variance  was  added  to  the  TDOA  estimates. 

The  transition  between  two  positive  real  roots,  and  one  positive  plus  negative  root,  was 
studied  in  more  detail  as  this  situation  gives  rise  to  a  potential  discontinuity  in  ft. 
However  it  was  found  that  the  smaller  of  the  two  positive  roots  remained  continuous  at 
the  transition,  while  the  larger  positive  root  was  discontinues  and  became  negative  at  the 
transition.  As  the  smaller  of  positive  root  is  chosen  as  the  solution  for  ft,  it  was  continuous. 


TDOA  distance  error 


0  0.5  1  1.5  2  2.5  3  3.5  4 

X-co-ordinate  (m) 


Figure  6.  Microphone  configuration  1;  plot  of  standard  deviation  of  coordinates  over  500  samples 
when  TDOA  values  had  additive  noise  of  variance  0.001  m2.  The  colour  bar  on  the  right 
side  shows  the  relative  colour  of  the  standard  deviation  between  0  and  1  m. 
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TDOA  distance  error 


X-co-ordinate  (m) 


Figure  7.  Microphone  configuration  1;  plot  of  standard  deviation  of  coordinates  over  500  samples 
when  the  TDOA  values  had  additive  noise  of  variance  0.1  m2.  The  colour  bar  on  the 
right  side  shows  the  relative  colour  of  the  standard  deviation  between  0  and  1  m. 


Figures  8-12  show  the  same  as  Figures  3-7  but  for  the  second  configuration  of 
microphones. 
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Figure  8.  Microphone  configuration  2;  the  shaded  area  indicates  where  the  quadratic  has  one 
positive  root  and  one  negative  root. 
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Two  positive  roots 


Figure  9.  Microphone  configuration  2;  the  shaded  area  indicates  where  the  quadratic  has  two 
positive  roots. 
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Combining  the  two  results 
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Figure  10.  Microphone  configuration  2;  combining  the  areas  shozvn  in  Figures  8  and  9.  The 
pink/magenta  area  is  where  one  positive  root  and  one  negative  root  occurred;  the  blue 
area  is  where  two  positive  roots  occurred. 
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Figure  11.  Microphone  configuration  2;  plot  of  standard  deviation  of  coordinates  over  500  samples 
when  the  TDOA  values  had  additive  noise  of  variance  0.001  m2.  The  colour  bar  on  the 
right  side  shows  the  relative  colour  of  the  standard  deviation  between  0  and  1  m. 


Figure  12.  Microphone  configuration  2;  plot  of  standard  deviation  of  coordinates  over  500  samples 
when  the  TDOA  values  had  additive  noise  of  variance  0.1  m2.  The  colour  bar  on  the 
right  side  shows  the  relative  colour  of  the  standard  deviation  between  0  and  1  m. 
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Note  that  the  array  configuration  in  Figure  12  results  in  greater  position  variance  than  the 
configuration  in  Figure  7  especially  in  the  corners  of  the  room.  This  is  expected  to  be  due 
to  the  microphones  being  positioned  further  apart  in  Figure  7,  thus  enclosing  a  larger  area 
and  giving  a  better  Geometric  Dilution  of  Precision  (GDOP)  over  a  larger  area. 

4.1.2  Ho-Chan  Method 

This  section  gives  the  main  steps  and  motivation  of  the  Ho-Chan  method.  For  a  more 
detailed  derivation  as  well  as  definition  of  all  terms  used  in  this  discussion  see  Appendix 
A. 


The  Ho-Chan  method  starts  with  the  same  equations  as  the  SX  method  and  arrives  at  the 
same  result  as  (3).  By  assuming  that  the  variables  x,  y  and  r\  are  independent,  the  error 
vector  ip  is  defined  as 


h  -  GaZa°  =  ip 


(5) 


where 


h  = 


~rl-K2+K;. 

*2,1 

y  2,i 

r2,l 

r\\  ~K3  +Kx 

Ga=-[X 

4- 

*3,1 

y  3,i 

r3,l 

rN,\  ~  Kn  +  Kl  J 

_XN.  1 

yN, ! 

W,  1. 

r  o 

X 

¥ 2 

Za°  = 

0 

y 

/  w  = 

¥  3 

ri° 

¥n_ 

za°  are  the  true  values  of  za 


x 


y 


which  correspond  to  the  estimated  source  location  (x,y) 


1/iJ 

and  the  estimated  distance  between  the  source  and  the  reference  microphone  n. 

When  the  true  TDOA  distances  are  available  ip  =  0.  In  practice  this  is  not  the  case.  The  Ho- 
Chan  method  involves  a  Least  Square  (LS)  calculation  to  minimise  ip: 


Za  =  (GaT1P-1Ga)-1GaTTP-1h.  (6) 

W  is  the  covariance  matrix  of  ip  and  is  given  by 

W  =  BQB  (7) 
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where 


r2  0  ...  0 

0  r3°  ; 

;  o 

0  •••  0  r“ 


Q  is  the  covariance  matrix  of  the  TDOA  distances. 


In  practice  W  is  not  known  since  B  contains  the  true  TDOA  distances.  The  Ho-Chan 
method  approximates  (6)  as 


Za  »  (GaTQ'1Ga)_1GaTQ'1h. 


(8) 


B  is  estimated  from  the  initial  estimates  of  za  from  (8)  and  is  put  into  (6)  to  achieve  a  better 
estimate  of  za.  (6)  can  be  iterated  multiple  times  to  produce  an  even  better  estimate,  but 
simulations  done  by  Chan  and  Ho  (1994)  show  that  one  iteration  is  sufficient  to  produce 
accurate  results. 


The  covariance  of  za  is  given  by 

COv(Za)  =  (Ga^lGa0)-1  (9) 

where  Ga°  is  identical  to  Ga  except  that  it  uses  the  exact  values  of  j'2,1  ..  nsf.i  rather  than  the 
estimated  ones. 

Let  the  elements  of  za  be  expressed  as 


Z*,l 

X° 

+  ei 

Za  = 

= 

V° 

+  e2 

_^«,3  J 

r° 

u 

where  e\,  ei  and  <23  are  estimation  errors  of  za. 


Chan  and  Ho  define  another  set  of  equations: 


h'  -  Ga'za'°  =  ip’ 
where 


(Za,  1  +Xl)2 

“l 

0" 

h'  = 

(Za,2  +Tl)2 

,Ga'  = 

0 

1 

2 

Z 

a,  3 

1 

1 

Za  = 


(x-Xi)2 


(10) 


(ii) 


v 


V 1 
¥2 

Vi 


ip’  is  the  vector  of  inaccuracies  in  za.  The  covariance  matrix  B*'  of  ip’  is 


15 


DSTO-TR-2126 


W  =  E[i p'ip'T]  =  4B'cov(za)B' 
where 


(12) 


0 

0 


B'  can  be  approximated  by  using  the  values  in  za.  A  second  LS  calculation  is  used  to  find 

Za' 


Za'  =  (Ga'^'-lGa^lGa'^P'-lh'  (13) 

za'  is  an  estimate  of  (x  -  xi )2  and  (y  -  yi)2  and  so  a  simple  conversion  will  produce  x  and  y 
estimates: 


±J  z 


Xi 

w 


(14) 


Since  there  are  two  possible  solutions  for  a  square  root,  the  correct  solution  is  the  one  that 
lies  in  the  region  of  interest.  This  can  be  easily  determined  by  looking  at  the  sign  of  the 
initial  za  estimates.  If  one  of  the  coordinates  is  close  to  zero  the  square  root  may  become 
imaginary.  In  this  case  the  imaginary  component  should  be  set  to  zero  (Chan  and  Ho 
1994). 

To  summarise,  (8)  is  used  to  find  an  initial  estimate  of  za  which  is  used  to  estimate  B.  (6)  is 
then  used  to  find  more  accurate  estimates  of  za  and  can  be  repeated  for  multiple  iterations. 
(13)  is  then  used  to  put  the  relationship  in  (1)  between  x,  y  and  n  into  the  calculation.  The 
final  coordinate  estimate  zp  is  found  using  (14). 

4.1.3  Comparison  of  Localisation  Algorithms 

Comparisons  were  made  between  the  SX  and  Ho-Chan  methods  and  the  Cramer-Rao 
Lower  Bound  (CRLB)  that  gives  the  achievable  minimum  error.  The  CRLB  sets  a  lower 
bound  on  the  variance  of  any  unbiased  estimator.  For  derivation  of  the  CRLB  for  TDOA 
localisation  see  Appendix  A.3. 

Simulations  were  done  using  two  virtual  rectangular  rooms  of  size  4x4  m  and  8x4  m.  Six 
microphone  sensors  were  placed  on  two  opposite  walls  of  the  room  as  shown  in  Figure  13; 
the  microphones  are  indicated  by  the  red  and  black  dots.  White  Gaussian  noise  of 
variances  0.01  m2  and  0.0001  m2  were  added  to  the  TDOA  values  calculated  from  each 
point  in  the  room  to  generate  standard  deviation  plots  for  the  CRLB,  SX  method  and  Ho- 
Chan  method.  These  plots  are  shown  in  Figures  21  to  64  in  Appendix  B. 
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Figures  13-15  show  the  standard  deviation  plots  for  one  case  where  the  room  size  is  8x4  m 
and  the  noise  variance  is  0.0001  m2.  The  red  dot  is  the  reference  microphone. 
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Figure  13.  Standard  deviation  of  CRLB  -position  error  when  TDOA  distances  had  additive  noise  of 
variance  0.0001  m2;  reference  sensor  in  the  middle. 


SX  Source  Position  Error  (ml 
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Figure  14.  Standard  deviation  of  position  error  using  the  SX  method  over  500  samples  when 
TDOA  distances  had  additive  noise  of  variance  0.0001  m2;  reference  sensor  in  the 

middle. 


Ho  Chan  Source  Position  Error  (ml 
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TDOA  distance  error;  Ho-Chan 
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Figure  15.  Standard  deviation  of -position  error  using  Ho-Chan  method  over  500  samples  when 
TDOA  distances  had  additive  noise  of  variance  0.0001  m2;  reference  sensor  in  the 

middle. 

By  observation  the  Ho-Chan  results  resemble  the  CRLB  much  more  closely  than  the  SX 
results.  To  see  this  more  accurately  the  CRLB  was  subtracted  from  the  standard  deviations 
of  SX  and  Ho-Chan  methods  and  are  shown  in  Figures  16  and  17. 

Note  that  Figures  16  and  17  are  on  a  scale  10  times  smaller  than  Figures  13-15.  It  is  clear 
that  the  Ho-Chan  method  produce  very  accurate  estimates  provided  that  the  input  errors 
are  small  (in  this  case  the  error  variance  is  0.0001  m2). 
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Figure  16.  Plot  of  CRLB  subtracted  from  SX  standard  deviations. 
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Hn  Chan  -  CRI.R  Source  Position  Prror  fml 
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Figure  17.  Plot  ofCRLB  subtracted  from  Ho-Chan  standard  deviations. 

In  spite  of  the  superior  performance  of  the  Ho-Chan  over  the  SX  method,  Chan  and  Ho 
(1994)  proposed  that  for  the  case  of  three  sensors  the  SX  method  should  be  used.  In 
simulations  the  Ho-Chan  method  does  not  work  with  three  sensors  due  to  singularities  in 
the  calculations.  When  there  are  four  sensors  it  was  found  that  the  Ho-Chan  method  had 
very  bad  results  in  areas  inside  the  microphone  array  as  shown  in  Figure  18.  This  limits  its 
use  to  when  there  are  five  or  more  sensors. 


Figure  18.  Standard  deviation  of  source  position  using  the  Ho-Chan  method  with  four  sensors. 
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4.2  Time-delay  Estimators 

The  Ho-Chan  method  worked  well  with  error  variances  close  to  that  of  the  CRLB  provided 
that  the  time  delays  used  for  calculations  are  accurate.  The  main  problem  is  with  obtaining 
these  accurate  time  delays.  Room  acoustics  such  as  background  noise  and  reverberation 
are  the  main  contributors  to  inaccurate  time  delay  estimates.  The  background  noise  in  the 
ICS  Lab  is  high  due  to  its  computer  servers  residing  in  the  same  room.  Reverberation  from 
the  walls  has  been  reduced  because  60%  of  the  walls  have  a  felt  covering.  However  it  still 
exists  due  to  the  uncovered  walls  and  ceiling  and  from  the  reflective  white-board  surfaces 
(existing  along  some  walls  and  covering  the  tables). 

The  ordinary  cross-correlation  is  a  common  method  used  to  obtain  time-delays  between 
signals  (Knapp  and  Carter  1976).  Two  alternative  time-delay  estimation  methods  have 
been  tested  against  the  ordinary  cross-correlation  method:  the  generalised  cross¬ 
correlation  using  phase  transform  (GCC-PHAT)  method  (Knapp  and  Carter  1976)  and  the 
eigenvalue  decomposition  (ED)  method  (Benesty  2000). 

4.2.1  Cross-correlation 

Cross-correlation  of  two  signals  is  the  convolution  of  one  with  the  complex  conjugate  of 
the  other.  The  cross-correlation  Rxi,xi(t)  of  two  signals  xi  and  X2  is  shown  in  (15): 

Rxl.xlit)  =  Xl+  X2  =  x1  (~t)  0  x2  (t)  (15) 

where  ★  denotes  cross-correlation;  0  denotes  convolution  defined  by 

x1(f)0x2(O=[  xx(r)x2(t  -  r)dr  .  (16) 

J— oo 


Cross-correlation  determines  how  correlated  two  signals  are.  Two  signals  that  are  identical 
but  time-shifted  by  a  certain  time-delay  will  have  a  cross-correlation  function  with  a 
strong  maximum  peak  at  that  time-delay  point  (Stein,  1981).  However,  when  reverberation 
distorts  the  input  signals,  each  reverberation  causes  an  extra  peak  in  the  cross-correlation 
function  as  sidelobes.  When  the  input  signal  is  very  periodic  the  cross-correlation  function 
is  also  very  periodic  in  form.  Each  of  the  periodic  peaks  would  have  its  own  sidelobes 
which  sum  with  the  existing  peaks.  This  makes  it  difficult  to  determine  which  peak  is  the 
central  time-delay  peak  and  which  are  just  reverberation  sidelobes. 

Background  noise  that  exists  in  the  signals  also  affects  the  plots  in  two  ways  depending  on 
whether  the  noise  is  random  or  whether  there  is  a  localised  noise  source.  Spatially  random 
noise  would  create  random  errors  in  the  plots  whereas  localised  noise  would  produce 
extra  correlation  peaks  relative  to  the  time  delay  of  the  noise  source  location.  Spatially 
random  noise  may  arise  from  a  large  number  of  distributed  noise  sources  such  as 
computers;  a  large  number  of  reflections  will  also  tend  to  de-correlate  the  noise  sources. 
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4.2.2  Generalised  Cross-correlation 

The  generalised  cross-correlation  can  be  explained  using  the  following  equations.  The 
original  cross-correlation  Rx i,xi(t)  is  related  to  the  cross  power  spectral  density  function 
Gx\xi{f)  by  the  Fourier  transform  relationship 

R,U2(t (17) 

The  generalised  cross-correlation  has  an  additional  weighting  value  in  the 

frequency  domain,  before  the  integration/  summation  step  and  is  of  the  form 

■  f  ■  (18) 


The  weighting  value  is  chosen  from  a  number  of  different  proposed  methods,  each  with 
their  advantages  and  disadvantages.  The  ordinary  cross-correlation  method  would  have 
(//(  /  )  =  1.  The  method  chosen  for  this  project  is  the  Phase  Transform  (GCC-PHAT). 


The  GCC-PHAT  uses  a  weighting  of 


K/)  = 


(19) 


which  produces 


Rgxl,xl{t)  = 


Gx\,x2  (/)  ej20d, 

G,U2(/)| 


(20) 


Essentially  the  GCC-PHAT  normalises  the  resulting  cross  spectral  power  density  of  the 
two  signals  to  a  constant  value  which  effectively  pre-whitens  the  cross-correlation 
function.  This  pre-whitening  equalises  the  amplitude  of  the  signals  across  the  frequency 
band  leaving  only  the  phase  information.  This  helps  to  reduce  the  effects  of  reverberation 
on  the  accuracy  of  the  TDOA  estimates  (Knapp  and  Carter  1976). 

4.2.3  Eigenvalue  Decomposition  (ED)  Method 

The  ED  method  is  based  on  estimating  the  impulse  response  between  the  two  signals.  A 
model  of  the  signals  received  Xi(n),  i  =  1,  2,  can  be  expressed  as 

xi(n)  =  gi®s(n)  +  bi(n)  (21) 


where  s;(n)  is  the  source,  g,  is  the  discrete  time  impulse  response  of  the  channel  between 
the  source  and  receiver  and  bi(n)  is  additive  noise. 
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Simplifying  (14)  by  removing  additive  noise,  we  have 

*1 00  =  gi  ®  S(n) ,  x2  (n)  =  g2  0  j(/i)  (22) 

therefore  assuming  that  the  system  (room)  is  linear,  time  invariant  and  noise  free, 

*1 00  ®  g2  =  Si  ®  $00  ®  g2  =  gi  ®  g2  ®  s(n)  =  g,  ®  x2  («) .  (23) 

From  this  we  have  the  relation 


xir(n)g2  =  X2T(n)gi 


(24) 


where  Xi  and  gi  are  vectors  of  the  signals  received  and  corresponding  impulse  responses 


respectively,  where  xi(n)  = 


Xi(n-\) 

Xj  (n-N  + 1) 


are  the  samples  along  the  N  tap  filter  used  to 


model  the  impulse  response. 

The  covariance  matrix  of  the  two  signals  is 


R  = 


R*Lvl 

R.y2x1 


where 


^x2x2 


(25) 


R xi  Xj  =  E{xi(n)xjT(n)}r  i,j  =  1,2. 


(26) 


Let 


L-SiJ 

From  (24)  and  (25)  we  have 

n  ^xlxl§2  ^xlx2§l 

Ru  = 

_^a2x1§2  _  ^*2*281. 
E{Xjxf}g2 -ElXjX^g, 

_E{x2xf}g2  -  E{x2x2 }gj 
E{x1xfg2}-E{x1x2g1} 
_E{x2x[g2}  -  E{x2x2gj} 


(27) 
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(28) 

meaning  that  vector  u  is  the  eigenvector  of  R  corresponding  to  an  eigenvalue  of  0  (Benesty 

2000). 

u  is  the  estimated  impulse  response  from  the  source  to  a  microphone.  It  is  calculated  using 
a  Least  Mean  Square  (LMS)  algorithm  with  the  following  equations: 

e(n)  =  u  T(n)\(n)  (29) 


u(«  + 1) 


u(n)  -  jue(n)x(n) 
u  (n)  -  jue(n)x(n)  || 


(30) 


where  e(n)  is  the  error  signal  and  }i  is  the  step  size. 

The  aim  is  not  to  accurately  estimate  the  impulse  responses  gi  and  gz  but  to  find  the  time- 
delay.  It  is  sufficient  to  just  detect  the  two  direct  paths.  By  initialising  a  tap  equal  to  1  in 
the  middle  of  the  first  half  of  u  (i.e.  in  the  middle  of  g2)  and  having  everything  else  zero 
that  particular  peak  will  always  be  dominant.  A  "mirror"  effect  will  appear  in  the  second 
half  of  u  (i.e.  in  gi)  in  the  form  of  a  negative  peak  that  will  dominate.  The  relative  position 
of  this  peak  compared  with  the  original  peak  in  the  first  half  of  u  is  the  time-delay 
(Benesty  2000). 

4.2.4  Theoretical  Performance  of  Time-delay  Estimation 

A  method  for  calculating  the  theoretical  performance  of  time-delay  estimation  is  given  by 
Ardoino,  Capriati  and  Zaccaron  (2006)  based  on  cross-correlation.  The  method  assumes  an 
ideal  environment  with  ideal  signals.  Hence  it  is  useful  as  a  lower  bound  estimator. 

Using  the  notation  in  Section  4.2.1  Rx i,.u(t)  is  the  cross-correlation  of  signal  x\  and  itself 
(autocorrelation  of  Xi).  The  magnitude  of  the  cross-correlation  function  |  Rxi,u(t)  |  is 
approximated  as 


P„  1 


(31) 


where  Px  1  is  the  signal  power  of  xi  and  Beq  is  termed  the  "Equivalent  Bandwidth"  given  by 


B 


eq 


2 


DA(/V/, 
j>s(/y/ ' 


(32) 
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Ps(f)  is  the  signal  Power  Spectral  Density  (PSD)  function  of  xi  (Ardoino  et  al.  2006). 

Let  B  =  1/Tcbe  the  sampling  frequency  where  Tc  is  the  sampling  time.  The  signal  is  valid 
within  the  Nyquist  frequency  band  of  -B/2  to  +B/2  and  zero  everywhere  else. 

Assuming  that  the  signal  power  remains  constant  (32)  now  becomes 


PAfYiyfdf 
Ps(f  )-f|  1  df 


1 

3 


f 


V 


Bi  B ^ 
8  +  8  , 

~B  B 

- 1 - 

2  2 


U' 


(33) 


The  analysis  of  the  cross-correlation  function  is  based  on  taking  three  points  of  the 
function  at  time  lags  -MTC,  0  and  +MTC  where  M  is  a  chosen  number  of  samples  either  side 
of  the  maximum  peak.  The  peak  is  assumed  to  always  lie  in  the  time-span  of  2 MTC .  A 
parabola  is  fitted  to  these  three  points  and  the  maximum  of  the  parabola  corresponds  to 
the  time-delay  estimate  (Ardoino  et  al.  2006). 


The  standard  deviation  of  the  time-delay  estimates  <?TDOA  is  given  by 


a 


1 

/■W  -  I 

1 

r  *  ' 

2 

| 

SNR2 

yj 

SNR 

(34) 


where  N  is  the  length  of  the  signal  in  number  of  samples  and  SNR  is  the  Signal-to-Noise 
Ratio. 


Given  (33),  (34)  can  be  simplified  as 


®TDOA 


Vh2 

y[N -In-B 


'  12  N 

K%7t2  -M2  J 


+ 


2 

SNR  ‘ 


(35) 


&TDOA  gives  a  standard  deviation  of  time  in  seconds.  Setting  B  to  1  will  give  a  standard 
deviation  in  number  of  samples. 

The  standard  deviation  was  calculated  for  each  of  the  SNR's  and  integration  periods  used 
in  the  experiments  in  section  6  of  this  report.  It  was  found  that  in  each  case  the  standard 
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deviation  was  less  than  0.2  of  a  sample.  The  actual  standard  deviations  are  shown  in  Table 
11  in  section  6.  Thus  uncorrelated  noise  on  the  microphones  is  not  expected  to  limit  the 
accuracy  of  the  timing  estimate  in  the  experiments  carried  out  in  this  report.  This  was  not 
observed  in  practice.  Reasons  for  this  discrepancy  include  reverberation  and  the  fact  that 
the  noise  in  the  room  is  coming  mainly  from  a  localised  source  and  hence  is  not 
independent  on  each  microphone,  as  assumed  by  the  model. 
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5.  Implementation 

The  localisation  system  was  implemented  in  three  parts:  the  localisation  algorithms  which 
produces  the  source  position  estimate  based  on  the  relative  time-delays  between  receivers, 
the  time-delay  estimators  which  produce  the  relative  time-delays  for  the  localisation 
algorithms,  and  an  iterator  to  repeat  the  location  estimation  for  tracking  purposes. 

5.1  Localisation  Algorithms  and  Iterator 

The  SX  and  Ho-Chan  methods  were  implemented  in  MATLAB.  This  system  was  designed 
to  be  used  in  the  ICS  Laboratory  using  four  microphones  placed  along  the  walls  so  only 
the  SX  method  was  used.  It  takes  inputs  of  the  four  microphone  coordinates  and  the 
relative  time-delays  produced  by  the  time-delay  estimators  described  in  Section  5.2.  Once 
the  calculation  is  complete  the  algorithm  chooses  the  coordinates  based  on  the  roots  of  the 
equation  as  described  in  Section  4.1.1. 

The  iterator  runs  a  time-delay  estimator  for  each  receiver  pair  and  the  localisation 
algorithm  multiple  times  to  track  the  location  of  the  source.  Two  iterators  were  created, 
one  which  reads  four  Microsoft  Wave  computer  sound  files  as  inputs  for  testing  purposes, 
the  other  to  read  direct  from  the  ASIO  interface  of  the  audio  I/O  card  on  the  computer. 
The  localisation  algorithm  contains  only  small  matrix  operations  which  amount  to  less 
than  100  multiplications  per  iteration.  Most  of  the  computation  is  done  for  the  time-delay 
estimators  which  process  larger  amounts  of  information  as  described  in  Section  5.2. 

The  iterator  reads  directly  from  the  computer  I/O  card  using  a  set  of  MATLAB  'mex'  code 
files  called  pa_wavplay  (http:/ / sourceforge.net/ projects/ pa-wavplay/).  pa_wavplay  was 
developed  by  Johnson  Chen  and  Matt  Frear  based  on  a  free  open-source  C  library  called 
Port  Audio  (http:/ / www.portaudio.com/).  The  file  set  includes  functions  to  record  and 
playback  audio  using  the  available  I/O  interfaces  on  the  computer. 

5.2  Time-delay  Estimators 

Both  the  GCC-PHAT  and  ED  methods  were  implemented  in  MATLAB  using  Fast  Fourier 
Transforms  (FFTs).  Both  methods  take  in  two  arrays  of  data  and  the  relative  time-delay 
between  them  is  found.  They  are  used  similarly  to  the  'xcorr'  (cross-correlation)  function 
of  MATLAB. 

The  ED  method  includes  an  LMS  algorithm  that  was  based  on  the  Block  LMS  (Fast  LMS) 
algorithm  from  Flay  kin  (1991).  In  addition  to  the  algorithm  itself  a  simple  pre- whitening 
process  (similar  to  that  in  GCC-PHAT)  was  implemented  on  the  data  that  was  input  into 
algorithm. 

GCC-PHAT  is  much  faster  since  it  only  requires  12L  multiplication  operations  per 
estimate  while  ED  requires  (floor(L/N)-l)(38N)+12N  multiplications,  where  L  is  length  of 
the  input  signals,  N  is  length  of  vector  u,  "floor"  indicates  rounding  down  to  the  nearest 
integer.  For  1  second  worth  of  data  at  44100Hz  sampling  rate  and  length  of  u  set  at  1580, 
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GCC-PHAT  requires  529200  multiplications  while  ED  requires  1580000  multiplications. 
For  0.5  second  data  this  becomes  264600  and  739440  respectively. 

A  rough  estimate  of  ED  calculations  by  assuming  N  is  fixed  and  removing  the  -1  means  it 
has  about  38L+12N  operations,  or  roughly  3  times  the  number  of  operations  in  GCC- 
PHAT  since  N  is  much  less  than  L  in  most  cases.  Hence  based  on  computational 
requirements  GCC-PHAT  would  be  chosen  over  the  ED  method.  This  may  not  be  a 
problem  in  modern  computers  if  the  real-time  sampling  and  calculations  were  conducted 
in  parallel,  however  if  conducted  in  series  then  a  longer  calculation  time  would  reduce  the 
number  of  available  estimates  in  a  fixed  time  frame. 

The  ED  method  is  regraded  to  be  more  robust  to  reverberation  than  the  GCC-PHAT,  since 
instead  of  finding  the  maximum  peak  (which  is  how  correlation-based  methods  work),  it 
tries  to  estimate  the  impulse  response  of  the  room  which  includes  the  time-delay  peak  as 
well  as  reverberation  peaks.  From  this  estimate  it  finds  the  direct  path  peak  by  choosing 
the  first  main  peak  in  the  impulse  response. 


6.  Experiments 

Sound  recordings  were  done  in  the  ICS  Laboratory  using  the  four  existing  Shure  Microflex 
MX393 /  O  omnidirectional  microphones  in  the  room  and  a  Behringer  ADA8000  8-channel 
A-D/D-A  converter.  The  microphones  were  placed  on  the  walls  of  the  room  in  a 
configuration  shown  in  Figure  19  at  a  height  of  2.3  metres.  The  samples  were  recorded 
straight  into  a  PC  computer  by  connecting  the  A-D  converter  to  a  RME  HDSP  9652  audio 
I/O  card  via  fibre  optic  cable  using  ASIO  interface. 


6.32  m 


Figure  19.  Microphone  configuration  in  ICS  Laboratory. 
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The  recordings  were  of  a  Gaussian  random  white  noise  source  generated  from  MATLAB 
and  played  through  a  Sony  SRS-PC300D  speaker  with  a  frequency  response  of  50- 
20000Hz.  Different  sound  levels  were  recorded.  The  noise  source  was  then  filtered  to 
different  frequency  band  outputs  and  this  was  also  recorded.  The  output  and  input 
sampling  rates  were  both  44100Hz. 

Table  1  shows  all  the  recordings  and  their  Signal-to-Noise  Ratios  (SNRs).  The  sound  level 
was  detected  using  a  Castle  GA208  Sound  Level  Meter  held  2cm  from  the  sound  source 
(speaker).  "Overload"  represents  when  the  sound  meter  was  saturated.  It  means  the  sound 
level  was  too  high  for  the  meter  to  detect  properly,  and  for  most  cases  is  around  lOOdBA. 


Table  1.  Samples  recorded  from  the  ICS  Laboratory. 


Frequency  (Hz) 

Sound  levels  (dBA) 

SNRs  (dB) 

White 

(50-20000) 

Overload 

10.05 

90 

-0.3309 

80 

-10.33 

70 

-20.33 

50-500 

95 

10.37 

90 

8.110 

80 

-1.890 

70 

-11.89 

500-2000 

Overload 

7.736 

90 

-5.698 

80 

-15.70 

70 

-25.70 

2000-5000 

Overload 

9.044 

90 

-7.494 

80 

-17.49 

70 

-27.49 

5000-10000 

Overload 

12.75 

90 

2.969 

80 

-7.031 

70 

-17.03 

10000-15000 

Overload 

13.28 

90 

8.169 

80 

-1.831 

70 

-11.83 

15000-20000 

75 

0.7958 

70 

-4.204 
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7.  Results 


Table  2  shows  the  true  time-delays  for  each  microphone  pair  in  samples. 


Table  2.  True  time-delays  for  each  microphone  pair  in  samples  (sampling  rate  44100  Hz). 


Mic  1 

Mic  2 

Time-delay  (samples) 

1 

2 

13 

2 

3 

119 

3 

4 

-9 

1 

4 

123 

1 

3 

132 

2 

4 

110 

Table  3  and  Table  4  show  the  estimated  time-delays  from  the  three  estimation  methods. 
The  tables  show  that  GCC-PHAT  had  the  most  accurate  results  out  of  the  three  by  the 
largest  number  of  estimates  close  to  the  true  time-delays.  It  was  surprising  to  see  that  none 
of  the  time-delay  estimators  worked  well  in  the  frequency  band  50-2000Hz. 

Frequency  spectra  of  the  received  signals  were  created  to  determine  the  cause  of  the 
problems  in  the  50-2000Hz  band.  These  spectra  plots  are  shown  in  Figures  66-91  in 
Appendix  C.  The  most  notable  observation  from  the  plots  is  the  high  background  noise 
that  exists  in  the  0-1500Hz  area  as  shown  in  Figure  20.  This  is  the  most  likely  reason  that 
even  GCC-PHAT  was  unable  to  produce  accurate  time-delays  in  this  frequency  band. 
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Table  3.  Time-delays  between  pairs  of  the  four  microphones  estimated  using  three  methods:  cross¬ 
correlation  (xcorr),  GCC-PHAT  (gcorr)  and  ED  (eigen).  " max  length "  indicates  the 
maximum  length  of  the  sample  in  seconds.  All  data  used  were  sampled  at  44100  Hz. 
“n"  is  the  length  of  data  used  for  estimating  the  time  delays  in  samples.  The  time-delays 
are  expressed  in  samples.  The  shaded  time-delay  estimates  are  within  5  samples  of  the 
true  time-delays. 
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Table  4.  Estimated  time-delays  using  three  methods:  cross-correlation  (xcorr),  generalised  cross¬ 
correlation  (gcorr)  and  eigenvalue  decomposition  (eigen).  Continued  from  Table  3. 
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Power  Spectrum 


Figure  20.  Frequency  spectrum  of  recorded  background  noise  in  the  ICS  Laboratory.  The  red  oval 
indicates  the  dominant  low  frequencies. 

The  conclusion  from  the  analysis  so  far  is  that  the  GCC-PHAT  is  the  best  of  the  three  time- 
delay  estimators.  However  this  analysis  uses  large  amounts  of  data.  In  real-time 
localisation  the  system  can  only  collect  small  amounts  of  data  at  a  time  so  that  the  location 
estimate  can  be  constantly  and  quickly  updated. 

The  next  analysis  tested  the  estimators'  abilities  to  work  with  small  amounts  of  data.  The 
same  data  were  broken  into  small  parts  and  given  to  the  estimators  to  find  the  time-delay 
of  each  part.  The  white  noise  samples  of  all  four  sound  levels  (overload,  90dBA,  80dBA 
and  70dBA)  were  used  for  this. 

Initially  the  15  second  samples  were  broken  into  1  second  samples.  The  time-delay 
estimates  were  found  from  each  sample  and  plotted.  Figures  92-95  in  Appendix  D  show 
that  while  all  three  estimators  worked  well  for  'overload'  white  noise,  when  the  sound 
level  is  reduced  the  cross-correlation  performs  badly.  In  contrast  both  GCC-PHAT  and  ED 
methods  performed  very  well  having  at  most  only  2  outliers  in  a  plot.  The  samples  were 
then  broken  into  0.1  second  samples  mainly  to  find  whether  GCC-PHAT  or  ED  method  is 
better  when  less  information  is  available.  Cross-correlation  was  also  tested  as  a  reference. 
These  plots  are  shown  in  Figures  96-99  in  Appendix  D. 
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Table  5  and  Table  6  summarise  the  number  of  outliers  in  each  set  of  estimates. 


Table  5.  Number  of  outliers  for  estimates  ofl  second  data  length  for  cross-correlation  (xcorr),  GCC- 
PHAT  (gcorr)  and  ED  method  (eigen)  shown  for  each  microphone  pair. 


Estimates  with  1  second  length  data,  14  estimates  per  set 
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0 

Table  6.  Number  of  outliers  for  estimates  of  0.1  second  data  length  for  cross-correlation  (xcorr), 
GCC-PHAT  (gcorr)  and  ED  method  (eigen)  shown  for  each  microphone  pair. 


Estimates  with  0.1  second  length  data,  147  estimates  per  set 

Microphone  pairs  — > 
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0 

33 
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Table  5  and  Table  6  clearly  show  that  the  ordinary  cross-correlation  works  only  for 
'overload'  sound  levels,  for  lower  signal  power  it  is  unable  to  provide  correct  estimates  for 
the  majority  of  the  time.  GCC-PHAT  and  ED  methods  worked  much  better  and  only  to 
show  small  numbers  of  outliers  in  the  70  dBA  and  80  dBA  levels.  Overall  GCC-PHAT  has 
less  outliers  than  the  ED  method. 

One  observation  from  the  plots  in  Figures  92-96  is  that  there  is  a  repeating  striation  effect 
in  the  cross-correlation  plots,  more  notably  in  Figures  97-99.  This  effect  is  probably  caused 
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by  correlations  in  the  background  noise  in  the  room,  but  could  also  be  due  to 
reverberation.  This  effect  is  reduced  in  GCC-PHAT  and  ED  plots  possibly  due  to  the  pre¬ 
whitening  step,  which  effectively  flattens  the  background  noise  spectrum,  thus  reducing 
the  relative  background  noise  power.  The  pre-whitening  filter  also  reduces  the  side-lobe 
levels  in  the  auto-correlation  function  of  both  the  desired  and  background  noise  signals. 

Tables  7-10  show  some  statistical  information  obtained  from  the  analysis  and  includes  the 
mean,  median  and  variance  of  estimates.  The  column  headings  'diffmean'  and 
'diff median'  show  the  mean  and  median  after  the  true  values  have  been  subtracted  so  that 
deviations  can  be  seen  without  any  bias. 

Tables  7  and  8  show  information  for  the  time-delay  estimates  of  1  second  and  0.1  second 
length  data  respectively. 


Table  7.  Statistical  information  for  time-delay  estimates  ofl  second  length  data.  Values  are  shown 
in  samples,  sample  rate  44100Hz.  'x'  represents  cross-correlation,  ‘g  represents  GCC- 
PHAT  and  'e'  represents  ED  method.  The  numbers  next  to  the  letters  indicate 
microphone  pairs,  e.g.  xl2  means  the  estimates  for  microphone  pair  1  and  2  using  cross¬ 
correlation. 
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Table  8.  Statistical  information  for  time-delay  estimates  of  0.1  second  length  data.  Values  are  shown 
in  samples,  sample  rate  44100Hz.  'x'  represents  cross-correlation,  ‘g  represents  GCC- 
PHAT  and  'e'  represents  ED  method.  The  numbers  next  to  the  letters  indicate 
microphone  pairs,  e.g.  xl2  means  the  estimates  for  microphone  pair  1  and  2  using  cross¬ 
correlation. 
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0.0884 

0 

x13 

-1.0544 

87 

1286800 

-133.0544 

-45 

gi3 

124.6803 

133 

1020.9 

-7.3197 

1 

e13 

129.0884 

133 

3640.4 

-2.9116 

1 

x24 

55.8639 

505 

927260 

-54.1361 

395 

g24 

110 

110 

0 

0 

0 

e24 

110 

110 

0 

0 

0 

The  true  coordinates  of  the  source  based  on  the  true  time-delays  is  (4.1579,  1.7109). 
Localisation  estimates  were  done  from  these  time-delay  estimates  using  the  SX  method. 
Table  9  and  Table  10  show  statistical  information  for  the  localisation  estimates  using  the 
time-delay  estimates  for  1  second  data  and  0.1  second  data  respectively,  'diffmean'  and 
'diffmedian'  represent  the  same  as  that  for  Tables  7  and  8  where  the  true  values  were 
subtracted  from  the  mean  and  median  to  remove  bias. 
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Table  9.  Statistical  information  for  localisation  coordinate  estimates  using  time-delay  estimates  ofl 
second  length  data.  Coordinates  x  and  y  are  shown  in  metres,  'xcorr',  'gcorr'  and  'eigen' 
represent  cross-correlation,  GCC-PHAT  and  ED  methods  respectively. 


xcorr 

gcorr 

eigen 

white  overload 

variance 

diffmean 

diffmedian 

mean 

mean 

median 

diffmedian 

3.95E-07 

0.0002 

0 

X 

4.1589 

4.1596 

7.91  E-07 

0.001 

0.0017 

X 

4.1589 

4.1596 

7.91  E-07 

0.001 

0.0017 

H 

1.41E-06 

0.0005 

0 

V 

1.7128 

1.7142 

2.82E-06 

0.0019 

0.0033 

V 

1.7128 

1.7142 

2.82E-06 

0.0019 

0.0033 

white  90dBA 

mean 

median 

variance 

EHEB1 

mean 

median 

variance 

mean 

median 

variance 

diffmedian 

HgM 

8.4034 

-3.7226 

X 

4.1604 

4.1597 

3.78E-06 

0.0025 

0.0018 

X 

4.1604 

4.1597 

3.78E-06 

0.0025 

0.0018 

n 

29.1892 

2.2603 

y 

1.7092 

1.709 

8.45E-07 

-0.0017 

-0.0019 

y 

1.7092 

1.709 

8.45E-07 

-0.0017 

-0.0019 

white  80dBA 

mean 

median 

variance 

diffmean 

diffmedian 

mean 

median 

□ 

mean 

median 

diffmedian 

X 

18.018 

-13.7524 

9704.7 

13.8601 

-17.9103 

X 

4.1405 

4.1597 

0.006 

-0.0174 

0.0018 

4.1597 

0.1782 

0.092 

0.0018 

V 

90.1342 

34.3648 

20002 

88.4233 

32.6539 

y 

1.7019 

1.709 

8.00E-04 

-0.009 

-0.0019 

n 

1.709 

0.017 

0.0246 

-0.0019 

white  70dBA 

mean 

median 

variance 

diffmean 

diffmedian 

mean 

median 

EBilirailETil 

mean 

median 

diffmedian 

X 

10.343 

2.8973 

4365.6 

6.1851 

-1.2606 

X 

4.1596 

4.1597 

2.46E-06 

0.0017 

0.0018 

X 

4.1598 

4.1597 

2.21  E-06 

0.0019 

0.0018 

y 

51.4181 

20.6387 

9837.9 

49.7072 

18.9278 

y 

1.7095 

1.709 

7.83E-07 

-0.0014 

-0.0019 

y 

1.7094 

1.709 

6.31  E-07 

-0.0015 

-0.0019 

Table  10.  Statistical  information  for  localisation  coordinate  estimates  using  time-delay  estimates  of 
0.1  second  length  data.  Coordinates  x  and  y  are  shown  in  metres,  'xcorr',  'gcorr'  and 
'eigen'  represent  cross-correlation,  GCC-PHAT  and  ED  methods  respectively. 


xcorr 

gcorr 

eigen 

white  overload 

mean 

median 

variance 

mean 

median 

variance 

e mm 

—I 

mean 

median 

diffmedian 

4.1242E-07 

0.0003 

0 

4.1596 

6.96E-07 

0.0011 

4.159 

4.1596 

6.84E-07 

0.0011 

0.0017 

n 

1.4712E-06 

0.0006 

0 

fl 

1.7142 

2.48E-06 

0.0021 

n 

1.7131 

1.7142 

2.44E-06 

0.0022 

0.0033 

white  90dBA 

mean 

median 

variance 

BllilireilETil 

mean 

median 

variance 

mm 

EBiffigilEia 

mean 

median 

diffmedian 

40.7387 

-3.3762 

-1.6938 

X 

4.1582 

4.1597 

5.70E-04 

0.0003 

0.0018 

X 

4.1358 

4.1597 

0.0064 

-0.0221 

0.0018 

B 

16.7858 

1.0937 

1.096 

y 

1.7087 

1.709 

7.69E-05 

-0.0022 

-0.0019 

V 

1.7019 

1.709 

0.001 

-0.009 

-0.0019 

white  80dBA 

mean 

EES1 

variance 

mean 

IKS 

mean 

diffmedian 

X 

-1.9362 

1.3293 

293.0539 

-6.0941 

-2.8286 

X 

4.1529 

4.1597 

0.0022 

-0.005 

0.0018 

4.1244 

4.1597 

0.0089 

-0.0335 

0.0018 

y 

5.1116 

4.0725 

170.3808 

3.4007 

2.3616 

y 

1.7066 

1.709 

0.000299 

-0.0043 

-0.0019 

n 

1.7006 

1.709 

0.0017 

-0.0103 

-0.0019 

white  70dBA 

mean 

median 

variance 

mean 

median 

variance 

mm 

mean 

median 

diffmedian 

X 

-2.437 

0.4154 

249.4983 

-6.5949 

-3.7425 

X 

4.1544 

4.1597 

0.0073 

-0.0035 

0.0018 

X 

4.0795 

4.1597 

0.3874 

-0.0784 

0.0018 

y 

4.6517 

3.6709 

77.3434 

2.9408 

1.96 

y 

1.7491 

1.709 

0.0135 

0.0382 

-0.0019 

y 

1.6864 

1.709 

0.1324 

-0.0245 

-0.0019 

Tables  7  and  8  show  that  the  ordinary  cross-correlation  has  mean  and  median  values  with 
high  deviation  from  the  true  values  and  very  high  variance  for  all  data  other  than  that  of 
'overload'  noise  level.  GCC-PHAT  and  ED  methods  held  up  well  with  medians  differing 
by  1  sample  at  the  most  from  the  true  value;  however  the  outliers  created  mean  and 
variance  values  that  detracted  away  from  their  overall  effectiveness  shown  in  the  plots  in 
Figures  92-99.  The  coordinate  estimates  in  Tables  9  and  10  reflect  what  was  shown  in 
Tables  7  and  8  with  GCC-PHAT  and  ED  methods  producing  good  estimates  the  majority 
of  the  time  while  the  cross-correlation  only  does  so  for  'overload'  sound  level  data.  Overall 
GCC-PHAT  had  the  best  statistical  results. 
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The  theoretical  standard  deviations  were  calculated  for  1  second  and  0.1  second  estimates 
using  equation  (35)  and  are  shown  in  Table  11  in  units  of  samples.  Note  that  the  theoretical 
standard  deviation  is  significantly  less  than  that  measured  in  practice,  for  example  for  the 
70  dBA  power  it  is  0.17  samples2  which  corresponds  to  7.6E-6  m2  at  a  44  KHz  sample  rate, 
which  is  significantly  less  than  0.013  m2  in  Table  10.  Overall  the  theoretical  standard 
deviation  remains  much  less  than  one  sample  and  hence  should  not  affect  the  accuracy  of 
the  measurement.  The  increase  in  variance  in  the  practical  system  is  thought  to  be  due  to 
reverberation  and  the  fact  that  the  background  noise  was  generated  by  a  localised  sound 
source  and  hence  was  not  completely  un-correlated  between  microphones. 


Table  11.  Theoretical  standard  deviations  for  1  second  and  0.1  second  TDOA  estimates  in  samples. 


Sound  level  of  sample 

Standard  deviations 
(samples)  1  sec 

Standard  deviations 
(samples)  0.1  sec 

Overload 

1.1675e-003 

3.6919e-003 

90  dBA 

3.8746e-003 

1.2253e-002 

80  dBA 

1.2741e-002 

4.0289e-002 

70  dBA 

5.3322e-002 

0.1686 

8.  Conclusions 


Two  speaker  localisation  algorithms,  the  Spherical  Intersection  (SX)  method  and  the  Ho- 
Chan  method,  were  implemented  in  MATLAB.  Simulations  were  done  to  test  their 
effectiveness  to  locate  a  source  in  a  virtual  room.  Both  methods  were  successful  in  locating 
a  source  anywhere  in  the  room  provided  that  the  time-delay  estimates  were  accurate, 
although  the  Ho-Chan  method  has  smaller  error  variances. 

Three  time-delay  estimators  were  compared:  the  ordinary  cross-correlation,  the 
generalised  cross-correlation  using  phase  transform  (GCC-PHAT),  and  eigenvalue 
decomposition  (ED)  method.  GCC-PHAT  and  ED  were  implemented  in  MATLAB.  From 
the  experiments  with  the  three  time-delay  estimators,  it  is  clear  that  the  ordinary  cross¬ 
correlation  is  not  able  to  work  well  in  environments  like  the  ICS  Laboratory  where  there  is 
reverberation  and  significant  background  noise.  The  GCC-PHAT  method,  although  only 
an  extension  to  the  ordinary  cross-correlation,  works  well  in  the  face  of  background  noise 
and  is  good  for  real-time  applications  due  to  its  low  calculation  time.  The  ED  method  is  a 
strong  competitor  to  the  GCC-PHAT  due  to  its  similar  performance.  However  it  has  a 
higher  computation  requirement  and  would  be  slower  than  GCC-PHAT.  Hence  GCC- 
PHAT  would  be  better  suited  for  real-time  use. 
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9.  Future  Developments 

The  system  in  its  current  state  is  incomplete.  The  objective  of  the  research  is  to  eventually 
have  a  system  that  will  detect  the  location  of  a  person  speaking  in  real-time.  There  are 
many  areas  that  can  be  improved  with  further  research  and  development. 

The  Ho-Chan  method  is  good  provided  that  the  time-delays  are  accurate.  Currently  the 
algorithm  is  implemented  for  two-dimensional  localisation.  This  can  produce  errors  since 
the  speaker  sources  and  the  microphone  receivers  do  not  always  exist  in  the  same  2D 
plane  due  to  height  differences  (such  as  the  difference  between  a  sitting  speaker  and  a 
standing  speaker).  An  implementation  for  three  dimensions  would  therefore  provide  more 
accurate  estimates. 

The  ED  method  was  originally  designed  to  be  an  estimator  that  is  robust  to  reverberation 
(Benesty  2000).  ED  is  a  relatively  new  method  compared  with  GCC-PHAT  and  hence  it  is 
likely  to  be  improved  with  further  development.  Research  can  be  done  to  improve  its 
robustness  to  both  background  noise  and  reverberation  while  being  able  to  work  with 
smaller  amounts  of  data  to  increase  speed. 

Further  improvement  of  time-delay  estimation  can  be  accomplished  using  noise  reduction 
or  noise  cancelling  techniques.  These  can  remove  or  lessen  much  of  the  background  noise 
that  has  an  adverse  effect  on  the  time-delay  estimators. 

A  Kalman  filter  tries  to  minimise  the  mean  squared  error  to  estimate  the  state  of  a  process, 
and  supports  estimation  of  past,  present  and  future  states  (Welch  and  Bishop  1995).  The 
implementation  of  a  Kalman  filter  in  the  system  should  be  considered  in  the  future  as  it 
would  improve  the  tracking  accuracy.  More  advanced  tracking  algorithms  such  as  Particle 
Filters  should  also  be  considered. 

The  system  currently  tries  to  localise  based  on  any  signals  it  receives.  Voice  detection 
methods  can  be  implemented  to  activate  the  localisation  only  when  people  are  speaking 
rather  than  for  random  noises  and  sounds. 

It  will  also  be  required  to  extend  the  localisation  algorithm  for  multiple  simultaneous 
speakers. 
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Appendix  A:  Derivation  of  Algorithms 


A.l.  Derivation  of  Spherical  Intersection  (SX)  method 

The  SX  method  for  localisation  uses  matrices  to  find  the  position  of  a  source  based  on 
time-delays  between  source  and  receivers.  This  method  for  localisation  uses  the  notion  of  a 
reference  receiver  so  that  time-delays  can  be  compared  to  each  other. 

Let  the  sound  source  be  at  the  unknown  position  (x,  y)  and  the  N  sensors  to  be  at  the 
known  coordinates  (x„  y)  for  i  =  1  to  N. 

The  squared  distance  between  the  source  and  receiver  i  is  given  by 

ri  =(x,  -  x )2  +{yi-y)2 

=  x2  +  y2  -  2x,x  -  2y,.y  +  x2  +  yf  (31) 


Let 


Then 

r2  =  x2  +  y 2  -  2xtx  -  2 y.y  +  Kt  .  (32) 

Without  any  loss  of  generality,  let  sensor  1  be  the  reference  sensor.  For  the  case  when  i  =  1 

r2  -  x2  +  y 2  -  2xjX  -  2 yxy  +  Kx  (33) 

Let  di, i  be  the  TDOA  between  sensor  i  and  sensor  1  (the  reference  receiver)  c  be  the  speed  of 
propagation  of  the  signal  and  n, i  be  the  TDOA  distance.  Then 

L.i  =  cd,  x  =  r(  -  r,  (34) 

Rearranging  (34)  and  squaring 

K  =irL  1  +q)2 

=  rlx  +  2ri, lb  +  r\  ■  (35) 

Equating  (32)  and  (35) 

r2x  +  2ri  xrx  +  r2  =  x2  +  y 2  -  2xtx  -  2 yiy  +  Kt  (36) 
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Subtract  (33)  from  (36) 

rl\  +  2ri,iri  =  ~2xulx  -  2 yi  Xy  +  Kt  -  Kl 


(37) 


where 


Xi, i  =  xi  -  xi,  yi,  i  =  y;  -  y i. 


If  this  is  done  for  each  i  then  a  matrix  equation  can  be  found  for  (x,  y) 

x 


2X 


y 


2  An  +  B 


(38) 


where 


1 

K> 

y  2,i 

1 

_ 1 

PL 

_ 1 

1 

to 

-x 

*3,1 

T3.1 

,  A  =  - 

r3.l 

,  B  =  - 

r2 

A3,l 

+ 

x 

1 

1 

rN,  1  _ 

1 

_ 1 

Kn 

-&x_ 

Pre-multiplying  both  sides  by  XT  and  dividing  by  2,  where  X 1  =  transpose  of  X 


XTX 


X 

(  1  ^ 

=  xT 

dq  +-B 

_y_ 

l  2  J 

(39) 


Pre-multiplying  both  sides  by  the  inverse  of  XT X  gives  a  set  of  equations  in  the  form  of 


=  Cq  +  D 


(40) 


where 

c  = 


c. 


=  (xT  x)~l  XT  A 


\DX  1 

rn 

,  D  = 

DA 

=  (xTx)  XT 

I2J 

B. 


Note  that  the  above  equation  can  only  be  solved  if  the  inverse  of  X1  X  exists.  This 
requires  X  to  have  more  than  one  row  (which  corresponds  to  having  three  or  more 
microphones)  and  the  rows  of  X  to  be  linearly  independent  (which  corresponds  to  the 
microphones  not  being  collinear).  Substituting  (40)  into  (33)  gives  an  equation  in  the  form 
of 


ar2  +  /?q  +  x  =  0 


(41) 
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where 


a  =  C;+C;-\, 

P  =  2(Cx(D,-xl)+Cy{Dy-yt)), 

X  =  Dy(Dy-2xl)+Dy{Dy-2y,)+Kr 


Solving  the  quadratic  (41)  gives  values  for  r\  which  can  then  be  substituted  into  (40)  to  give 
an  estimate  for  (x,  y).  The  correct  value  to  choose  from  the  quadratic  roots  is  discussed  in 
Section  4.1.1. 


A.2.  Derivation  of  Ho-Chan  Method 

The  Ho-Chan  Methods  starts  with  the  same  equations  as  the  SX  method  and  arrives  at  the 
same  result  as  (38)  which  is  repeated  here: 


2X\ 


x 


y. 


=  2  Ar\  +  B 


where 


*2,1 

y  2,i 

C,1 

2C 

_ 1 

1 

to 

-k; 

x  = 

*3,1 

y  3,i 

,  A  =  - 

73,1 

,  B  =  - 

r 2 

3,1 

+ 

K, 

XN,  1 

i 

/V,iJ 

_ 1 

1 

-*i_ 

Arranging  this  equation: 


X 


x 


Lyj 


Ar=-B 

2 


Merging  the  left  side  of  (42)  and  re-arranging  becomes 


(38) 


(42) 


1 

2 


B  +  [X 


h 


0 


Based  on  (43)  define 
h  -  GaZa°  =  0 
where 


(43) 


(44) 
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~rl-K2+Kx~, 

1 

X 

to 

V 

to 

_ 1 

2  2 

r3,i  ~K3  +Kl 

,  Ga=  -[x  -a] 

*3,1  y  3,1  r3,l 

J~N,  1  _  Kn  +  Kx  J 

_XN,  1  Tv,l  rN,\_ 

o 

J\  _ 

x°  denotes  the  true  value  of  x.  (44)  only  holds  when  h  and  Ga  contain  the  true  TDOA 
distances  n,  1  for  i  =  2  to  N.  In  practice  the  true  TDOA  distances  are  not  available.  The  error 
vector  ip  is  thus  defined  as 

h  -  GaZa°  =  ip  (45) 

where 


¥ 2 
¥  3 

lp=  .  ■ 

¥n_ 

The  TDOA  distances  are  given  as 

ri,\  =  r/°i  +  nu  /  i  =  2toN  (46) 

where  r.\  are  the  true  TDOA  distances  and  nj ,  are  the  errors  in  the  TDOA  distances. 


From  the  definition  of  TDOA  distances 


.o  o 


(47) 


Substituting  (46)  and  (47)  into  (45) 


Vi  =  ^(t.i2  ~Ki  +  Kx )+V0+h,i/+''i/i0 

1  I  0  2  ,  O  0  ,  2  jr  jr  |  0,  0,00,0 

= -\ri,  1  +  2r,.  1«.  1  +  n.  1  +yyp  +rur1  niX. 
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Recall  from  (44)  that  with  true  TDOA  distances 


n  Wo2  7^,7^!,  0,  0,00 

o  =  -  vh  -  Ki  +  Ki )+  xi,ix  +  y^y  +  ri/\ 


hence 


V>i=  +ni,i2)+ri°ni,i 


=  |(2*>,,i  ~2rxni,i  +ni,l2)+rl°ni,l 


0  1  2 
=  ri  ni,i  +  2ni,i  ■ 


Therefore 


ip  =  Bn  +  —  nOn 

2 


where 


1 

0 

...  o" 

*2,1 

B  = 

0 

r° 

a3 

0 

7  n  = 

*3,1 

0 

1 

o  > 

C1 

o 

UN, !_ 

(48) 


(49) 


n  is  the  error  vector  of  the  TDOA  distances;  the  symbol  O  represents  the  Schur  product 
(element-by-element  product). 

The  covariance  matrix  W  of  ip  is  evaluated  with  the  condition  that  the  signal-to-noise  ratio 
(SNR)  is  high,  i.e.  m,\  «  r,°.  The  second  term  on  the  right  of  (49)  is  ignored  and  the 
covariance  matrix  is  given  by 


=  E [ip  ipT]  =  BQB 


(50) 


where  E[x]  is  the  expected  value  of  x  and  Q  is  the  covariance  matrix  of  n. 

The  set  of  equations  in  (45)  are  solved  by  LS  with  the  assumption  that  the  elements  in  za 
have  no  relationship: 

Za  =  (GaTW  lGa)  lGaTV  %  (51) 
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In  practice  V  is  not  known  since  B  contains  the  true  TDOA  distances.  The  Ho-Chan 
method  uses  further  approximation  to  estimate  B.  When  the  source  is  far  from  the  sensor 
array,  each  r,°  is  close  to  r°  so  that  B  ~  r°I,  where  r°  is  the  range  of  the  source  and  I  is  the 
identity  matrix  of  size  N- 1.  Scaling  of  W  does  not  affect  the  answer,  so  an  approximation  of 
(51)  is 

Za  -  (GaTQ'1Ga)_1GaTQ_1h  (52) 

For  the  case  where  the  source  is  close  to  the  sensor  array  (52)  is  used  to  obtain  an  initial 
estimate  of  za  which  is  used  to  find  an  estimate  of  B.  This  is  then  put  into  (51).  (51)  can  be 
iterated  multiple  times  to  produce  an  even  better  estimate,  but  simulations  done  by  Chan 
and  Ho  (1994)  show  that  one  iteration  is  sufficient  to  produce  accurate  results. 

The  covariance  of  za  is  given  by 

COv(Za)  =  E[AZaAZar]  =  (Ga^^Ga0)'1  (53) 

Where  Ga°  is  identical  to  Ga,  as  defined  in  (44),  except  that  the  TDOA  values,  >'2,1  r%i  ..  r\,i 
are  exact,  ie. 

x2,i  y 2,1  r2,i 

X3,l  T3,l  r3,l 
XN,  1  Tat,  1  rN,  1_ 

The  above  solution  of  za  assumes  that  x,  y  and  n  are  independent.  However  they  are 
related  by 

rx  =  (xi  -xf  +(Ti  -T)2  (54) 

Let  the  elements  of  za  be  expressed  as 


Z«,l 

~x° 

Za  = 

Z«,2 

= 

y° 

+ 

_Zaf,3  \ 

r° 

M 

+  6^ 

(55) 


where  e\,  ei  and  63  are  estimation  errors  of  za. 

Define  another  set  of  equations: 

h'  -  Ga'Za'O  =  ip’  (56) 


where 
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(Za,  1  +^l)2 

“l 

0" 

h’  = 

(Za,2  +yiY 

,Ga'  = 

0 

1 

2 

1 

1 

{x-xj 

(y-y<Y_ 


w'  = 


Vi 

¥2 

¥3'. 


is  the  vector  of  inaccuracies  in  za.  Substituting  (56)  into  (55)  gives 


tpi  =  2(x°  -  xi)ei  +  ei2 
ipi  =  2 (y°  -  yi)ei  +  ei2 
ip?,'  =  2ri%3  +  C32. 

For  small  errors  (57)  can  be  approximated  by 


ipi  «  2(x°  -  xi)ei 
xpi  «  2(y°  -  i/i)e2 
1^3' «  2n°e3. 


The  covariance  matrix  W'  of  ip'  is 
W  =  E[t p'lp'T]  =  4B'cov(za)B' 


where 


(57) 


(58) 


(59) 


B'  can  be  approximated  by  using  the  values  in  za.  A  second  LS  calculation  is  used  to  find 

Za' 


Za'  =  (Ga'^P'-lGa'^Ga'^F'-lh'  (60) 

The  Covariance  matrix  of  za'  is 

COv(Za')  =  (Ga'^Ga')-1  (61) 

Since  za'  is  an  estimate  of  ( x  -  Xi)2  and  (y  -  yi)2  a  simple  conversion  will  produce  x  and  y 
estimates: 


Zn  —  a/Z 


*1 

Ti 


(62) 
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Since  there  are  two  possible  solutions  for  a  square  root,  the  correct  solution  is  the  one  that 
lies  in  the  region  of  interest.  This  can  be  easily  determined  by  looking  at  the  sign  of  the 
initial  za  estimates.  If  one  of  the  coordinates  is  close  to  zero  the  square  root  may  become 
imaginary.  In  this  case  the  imaginary  component  should  be  set  to  zero  (Chan  and  Ho 
1994). 

The  covariance  matrix  O  of  zp  is 


<t>  =  cov(zp)  = 


^  B"-icov(za')B"-1 


where 


B"  = 


0 

(t°-Ti). 


(63) 


To  summarise,  (52)  is  used  to  find  an  initial  estimate  of  za  which  is  used  to  estimate  B.  (51) 
is  then  used  to  find  more  accurate  estimates  of  za  and  can  be  repeated  for  multiple 
iterations.  (60)  is  then  used  to  put  the  relationship  between  x,  y  and  n  into  the  calculation. 
The  final  coordinate  estimate  zp  is  found  using  (62). 


A.3.  Derivation  of  Cramer-Rao  Lower  Bound  (CRLB)  for  Localisation 

The  CRLB  for  localisation  is  based  on  Gromov,  Akos,  Pullen,  Enge  and  Parkinson  (2000). 
Gromov  et  al.  (2000)  give  a  CRLB  calculation  for  3D  localisation  that  uses  radius  distance, 
elevation  angle  and  azimuth  angle  errors  of  estimated  source  coordinates  from  the 
reference  (0,  0,  0)  point.  The  CRLB  Error  Covariance  Matrix  (ECM)  is  defined  by 

B,„={htB^hY  (64) 

where 


H  = 


8AR, 

8AR, 

8ARi 

8R 

8/3 

8s 

8AR2 

8AR2 

8AR2 

8R 

8(3 

8s 

SAR-n-i 

dARN- 1 

dARN- 1 

8R 

8(3 

8s 

,  R  =  radius,  /l  =  azimuth,  e  =  elevation. 


B2R  =  ECM  of  the  TDOA  distances. 
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A Ri  are  the  TDOA  distances  for  i  =  1  to  N-l,  N  is  the  number  of  sensors.  This  was  derived 
using  cosine  and  sine  rules  of  triangles: 


where  h,  /?,  and  e,  are  the  radius  distance,  azimuth  and  elevation  angles  of  sensor  i  for  i  =  1 
to  N. 

Using  the  same  principle  the  CRLB  of  2D  Cartesian  coordinate  localisation  was  derived. 
Taking  the  square  root  of  (31)  and  (54): 

r,  =V(X/  ~xY  +t Vi  -yf 

ri  =  V(xi  ~xf  +Cu  -yf 


the  TDOA  distances  are  given  by 


the  CRLB  ECM  is  thus 

ECM  =  (htB;^hY  (67) 

where  H  is 
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dC,l 

dr2,X 

dx 

dy 

drXX 

dx 

dy 

drN,l 

drN,X 

dx 

dy 

The  ECM  is  a  2x2  matrix  with  the  diagonals  containing  lower  bound  variances  for  x  and  y 
coordinates  in  the  first  and  second  diagonal  elements  respectively.  Summing  of  these  will 
result  in  the  lower  bound  variance  of  the  distance  used  in  the  CRLB  plots  in  Section  4.1 
and  Appendix  B. 

The  CRLB  for  3D  localisation  is  a  straight  extension  of  the  2D  case: 


ri  = 

\l' -  x f  +  ( v,.  -  yf  +  (z,  -  zf 

r\  = 

V(x  i  ~xf  +Cvi  -  yf  +izi  -zf 

C  i  = 

V(Xf  “  xf  + G v,  -  v)2  +  (z,.  -  zf  - 

f(x,  -xf  +{y !  -yf  +(zl-zf  , 

Xj  - X 

X,  -x 

dx 

yl(xi  —  xf  +  (y  ,•  -yf  +(zi-zf 

V(*i  —  xf  +  (y i  -  yf  +  -zf 

yt-y 

y,  -y 

dy 

>/(*>  -  xf  +  (v,.  -  yf  +  (zt  -  zf 

Vfo  -x)2  +(Ti  _t)2  +  (zi  -*)2 

Zi-Z 

Zj  -z 

dz 

V( xi  -  xf  +  ( y i  -  v)2  +  (z,  -  zf 

Vfe  ~x)2  +(ti  ~t)2  +  (zi  -z)2 

The  CRLB  is 

ECM  =  (htB;\H (70) 


where 
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dr2,l 

dr2,l 

dr2,l 

dx 

dy 

dz 

dr2,l 

dr2.X 

dr2,l 

dx 

dy 

dz 

drN,l 

SrN,l 

dr 

urN,  1 

dx 

dy 

dz 

The  resulting  ECM  is  a  3x3  matrix  with  the  diagonals  containing  the  lower  bound  variance 
of  x,  y  and  z  coordinates.  As  for  the  2D  CRLB,  summing  of  these  values  results  in  the 
distance  lower  bound  variance. 
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Appendix  B:  Standard  Deviation  Plots 

These  standard  deviation  plots  were  created  using  TDOA  distances  with  additive  noise  of 
set  variances.  There  are  four  sets  of  plots  which  correspond  to  the  two  room  sizes  and  the 
two  noise  variances  used.  Note  that  the  colour  bar  scale  on  the  side  of  each  figure  changes 
scale  depending  on  the  range  of  the  values.  Table  12  shows  the  different  sets  of  plots. 

Table  12.  Sets  of  standard  deviation  plots. 


Set  number 

Room  size 

Noise  variance 

1 

4x4  m 

0.01  m2 

2 

4x4  m 

0.0001  m2 

3 

8x4  m 

0.01  m2 

4 

8x4  m 

0.0001  m2 

Figures  21-30  show  standard  deviation  plots  for  Set  1. 


CRLB  of  Source  Position  Error  (ml 


1.5  2  2.5  3  3.5 

X-co-ordinate  (m) 


Figure  21.  Standard  deviation  plot  of  CRLB  of  TDOA  distance  error;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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Figure  22.  Standard  deviation  plot  of  SX  method  over  500  samples;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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Figure  23.  Standard  deviation  plot  ofHo-Chan  method  over  500  samples;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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SX  -  CRLB  Source  Position  Error  (mi 
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Figure  24.  Plot  of  CRLB  subtracted  from  SX  standard  deviations;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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Figure  25.  Plot  of  CRLB  subtracted  from  Ho-Chan  standard  deviations;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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Figure  26.  Standard  deviation  plot  of  CRLB  of  TDOA  distance  error;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 


Figure  27.  Standard  deviation  plot  of  SX  method  over  500  samples;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 
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Figure  28.  Standard  deviation  plot  ofHo-Chan  method  over  500  samples;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 
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Figure  29.  Plot  of  CRLB  subtracted  from  SX  standard  deviations;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 
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Ho  Chan  -  CRLB  Source  Position  Error 
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Figure  30.  Plot  of  CRLB  subtracted  from  Ho-Chan  standard  deviations;  room  size  =  4x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 


Figures  31-40  show  the  standard  deviation  plots  of  Set  2. 

CRLB  of  Source  Position  Error  ( m ) 

0.1 

0.09 

0.08 

0.07 

0.06 

0.05 

0.04 

0.03 

0.02 

0.01 

0 

0  0.5  1  1.5  2  2.5  3  3.5  4 

X-co-ordinate  (m) 


Figure  31.  Standard  deviation  plot  of  CRLB  of  TDOA  distance  error;  room  size  =  4x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  on  the  side. 
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Figure  32.  Standard  deviation  plot  of  SX  method  over  500  samples;  room  size  =  4x4  m;  noise 
variance  =  0.0001  m1;  reference  sensor  on  the  side. 
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Figure  33.  Standard  deviation  plot  ofHo-Chan  method  over  500  samples;  room  size  =  4x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  on  the  side. 
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Figure  34.  Plot  of  CRLB  subtracted  from  SX  standard  deviations;  room  size  =  4x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  on  the  side. 


Ho  Chan  -  CRLB  Source  Position  Error  (m) 

0.01 

0.009 

0.008 

0.007 

0.006 

0.005 

0.004 

0.003 

0.002 

0.001 

0 

0  0.5  1  1.5  2  2.5  3  3.5  4 

X-co-ordinate  (m) 


Figure  35.  Plot  of  CRLB  subtracted  from  Ho-Chan  standard  deviations;  room  size  =  4x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  on  the  side. 
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Figure  36.  Standard  deviation  plot  of  CRLB  of  TDOA  distance  error;  room  size  =  4x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 
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Figure  37.  Standard  deviation  plot  of  SX  method  over  500  samples;  room  size  =  4x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 
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Figure  38.  Standard  deviation  plot  ofHo-Chan  method  over  500  samples; 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 
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Figure  39.  Plot  of  CRLB  subtracted  from  SX  standard  deviations;  room  size  =  4x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 
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Figure  40.  Plot  of  CRLB  subtracted  from  Ho-Chan  standard  deviations;  room  size  =  4x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 


Figures  41-50  show  the  standard  deviation  plots  of  Set  3. 
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Figure  41.  Standard  deviation  plot  of  CRLB  of  TDOA  distance  error;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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Figure  42.  Standard  deviation  plot  of  SX  method  over  500  samples;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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Figure  43.  Standard  deviation  plot  ofHo-Chan  method  over  500  samples ;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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Figure  44.  Plot  of  CRLB  subtracted  from  SX  standard  deviations;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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Figure  45.  Plot  of  CRLB  subtracted  from  Ho-Chan  standard  deviations;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  on  the  side. 
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Figure  46.  Standard  deviation  plot  of  CRLB  of  TDOA  distance  error;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 
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Figure  47.  Standard  deviation  plot  of  SX  method  over  500  samples;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 
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Figure  48.  Standard  deviation  plot  ofHo-Chan  method  over  500  samples;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 
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Figure  49.  Plot  of  CRLB  subtracted  from  SX  standard  deviations;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 
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Figure  50.  Plot  of  CRLB  subtracted  from  Ho-Chan  standard  deviations;  room  size  =  8x4  m;  noise 
variance  =  0.01  m2;  reference  sensor  in  the  middle. 


Figures  51-60  show  the  standard  deviation  plots  of  Set  4. 
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Figure  51.  Standard  deviation  plot  of  CRLB  of  TDOA  distance  error;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  on  the  side. 
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Figure  52.  Standard  deviation  plot  of  SX  method  over  500  samples;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  on  the  side. 
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Figure  53.  Standard  deviation  plot  ofHo-Chan  method  over  500  samples;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  on  the  side. 
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Figure  54.  Plot  of  CRLB  subtracted  from  SX  standard  deviations; 
variance  =  0.0001  m2;  reference  sensor  on  the  side. 
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Figure  55.  Plot  of  CRLB  subtracted  from  Ho-Chan  standard  deviations;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  on  the  side. 
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Figure  56.  Standard  deviation  plot  of  CRLB  of  TDOA  distance  error;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 
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Figure  57.  Standard  deviation  plot  of  SX  method  over  500  samples;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 
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Figure  58.  Standard  deviation  plot  ofHo-Chan  method  over  500  samples;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 
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Figure  59.  Plot  of  CRLB  subtracted  from  SX  standard  deviations;  room  size 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 
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Figure  60.  Plot  of  CRLB  subtracted  from  Ho-Chan  standard  deviations;  room  size  =  8x4  m ;  noise 
variance  =  0.0001  m2;  reference  sensor  in  the  middle. 


Figures  61-65  are  have  the  same  room  size  and  noise  variance  as  Set  4.  The  difference  is 
that  the  microphones  are  randomly  placed  along  the  walls. 
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Figure  61.  Standard  deviation  plot  of  CRLB  ofTDOA  distance  error;  room  size 
variance  =  0.0001  m2;  random  microphone  array. 
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Figure  62.  Standard  deviation  plot  of  SX  method  over  500  samples;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  random  microphone  array. 
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Figure  63.  Standard  deviation  plot  ofHo-Chan  method  over  500  samples;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  random  microphone  array. 
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Figure  64.  Plot  of  CRLB  subtracted  from  SX  standard  deviations;  room  size  =  8x4  m ;  noise 
variance  =  0.0001  m2;  random  microphone  array. 
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Figure  65.  Plot  of  CRLB  subtracted  from  Ho-Chan  standard  deviations;  room  size  =  8x4  m;  noise 
variance  =  0.0001  m2;  random  microphone  array. 
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Appendix  C:  Spectral  Plots  of  Experiment  Recordings 
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Figure  66.  White  noise,  overload. 
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Figure  68.  White  noise,  80dBA. 


74 


|X(fr  (dB)  c*  |X(f)r  (dB) 


DSTO-TR-2126 


Power  Spectrum 


White  noise,  70dBA. 


Power  Spectrum 


Figure  70.  50-500Hz  noise,  95dBA. 
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Figure  71.  50-500Hz  noise,  90dBA. 
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Figure  72.  50-500Hz  noise,  80dBA. 
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Figure  73.  50-500Hz  noise,  70dBA. 
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Figure  74.  500-2000Hz  noise,  overload. 
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Figure  75.  500-2000Hz  noise,  90dBA. 
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Figure  76.  500-2000Hz  noise,  80dBA. 


78 


DSTO-TR-2126 


Power  Spectrum 


Figure  77.  500-2000Hz  noise,  70dBA. 
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Figure  78.  2000-5000Hz  noise,  overload. 
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Figure  79.  2000-5000Hz  noise,  90ABA. 
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Figure  80.  2000-5000Hz  noise,  80dBA. 
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Figure  82.  5000-10000Hz  noise,  overload. 
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Figure  84.  5000-10000Hz  noise,  80dBA. 
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Figure  86.  10000-15000Hz  noise,  overload. 
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Figure  88.  10000-15000Hz  noise,  80dBA. 
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Figure  90.  15000-20000Hz  noise,  75dBA. 
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Figure  91.  15000-20000Hz  noise,  70dBA. 
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Appendix  D:  Plots  of  Time-delay  Estimates 

Figures  92-99  show  time-delay  estimate  plots  using  the  three  time-delay  estimation 
methods  being  cross-correlation,  GCC-PHAT  and  ED  methods. 

The  true  time-delays  were  calculated  in  metres  from  the  source  and  microphone 
coordinates  and  are  drawn  as  blue  lines.  The  estimated  time-delays  were  calculated  in 
samples  and  converted  to  distance  in  metres  by  dividing  by  the  sampling  rate  (44100  Hz) 
and  multiplying  by  the  speed  of  sound  in  air  at  20°C  (345.1  ms1).  These  estimates  are 
plotted  in  red  dots.  The  'index'  value  on  the  y-axis  of  plots  represent  which  sample  block 
the  estimate  came  from  after  the  large  samples  were  broken  into  smaller  samples,  e.g.  an 
estimate  with  index  of  10  means  that  estimate  came  from  the  10th  small  sample  block  of  the 
large  sample  (see  Section  7  for  explanation). 
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92.  Time-delay  estimate  plots  using  cross-correlation,  GCC-PHAT  and  ED  method  for 
'overload'  white  noise.  The  red  plots  represent  the  estimates;  the  blue  lines  are  the  true 
time-delays. 
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Figure  93.  Time-delay  estimate  plots  using  cross-correlation,  GCC-PHAT  and  ED  method  for 
'90dBA'  white  noise.  The  red  plots  represent  the  estimates ;  the  blue  lines  are  the  true 
time-delays;  *  indicates  that  the  x-axis  of  that  plot  has  a  different  scale. 
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Figure  94.  Time-delay  estimate  plots  using  cross-correlation,  GCC-PHAT  and  ED  method  for 
'80dBA'  white  noise.  The  red  plots  represent  the  estimates ;  the  blue  lines  are  the  true 
time-delays;  *  indicates  that  the  x-axis  of  that  plot  has  a  different  scale. 
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Figure  95.  Time-delay  estimate  plots  using  cross-correlation,  GCC-PHAT  and  ED  method  for 
'70dBA'  white  noise.  The  red  plots  represent  the  estimates ;  the  blue  lines  are  the  true 
time-delays;  *  indicates  that  the  x-axis  of  that  plot  has  a  different  scale. 
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Figure  96.  Time-delay  estimate  plots  using  cross-correlation,  GCC-PHAT  and  ED  method  for 
'overload'  white  noise.  The  red  plots  represent  the  estimates;  the  blue  lines  are  the  true 
time-delays. 
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Figure  97.  Time-delay  estimate  plots  using  cross-correlation,  GCC-PHAT  and  ED  method  for 
'90dBA'  white  noise.  The  red  plots  represent  the  estimates ;  the  blue  lines  are  the  true 
time-delays;  *  indicates  that  the  x-axis  of  that  plot  has  a  different  scale. 
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Figure  98.  Time-delay  estimate  plots  using  cross-correlation,  GCC-PHAT  and  ED  method  for 
'80dBA'  white  noise.  The  red  plots  represent  the  estimates ;  the  blue  lines  are  the  true 
time-delays;  *  indicates  that  the  x-axis  of  that  plot  has  a  different  scale. 
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Figure  99.  Time-delay  estimate  plots  using  cross-correlation,  GCC-PHAT  and  ED  method  for 
'70dBA'  white  noise.  The  red  plots  represent  the  estimates ;  the  blue  lines  are  the  true 
time-delays;  *  indicates  that  the  x-axis  of  that  plot  has  a  different  scale. 
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